Nadiia Honcharenko (220681, nadiia.honcharenko@st.ovgu.de)
Rutuja Shivraj Pawar (220051, rutuja.pawar@ovgu.de)
Shivani Jadhav (223856, shivani.jadhav@st.ovgu.de)
Sumit Kundu (217453, sumit.kundu@st.ovgu.de)
M.Sc. Uli Niemann
A customer is a key-centric factor for any business to be successful. Conventional wisdom tells us that the cost of retaining an existing customer is far less than acquiring a new one. In order that a business has a sustainable growth, the retention of its old customer base and expansion of the new customer base is very critical. This demands from a business to understand the behaviour of its customers in relation to the business. Therefore obtaining a 360° view of its customers is crucial for a business looking for a competitive edge in the market. In such a scenario, Customer Behavioural Analytics plays an important role in leveraging data analytics to find meaningful behavioural patterns in the customer-specific business data.
Consequently, this project aims to understand the consumer behaviour in the retail sector. Decoding the consumer behaviour will be based on understanding how consumers make purchase decisions and what factors influence those decisions. This project also aims to discover existence of dependencies between customers, products and shops to highlight further insights about their behaviour. These meaningful insights will further help a business to implement strategies leading to an increased revenue through customer satisfaction.
This project aims to address the problem of understanding behaviour of customers of an Italian retail distribution company Coop in a single Italian city. The project intends to discover different analytical insights about the purchase behaviour of the customers through answering different formulated Research Questions (RQ)
Supermarket aggr.Customer1
The dataset to be used is the retail market data of one of the largest Italian retail distribution company called Coop for a single Italian city.
The Supermarket aggr.Customer dataset used for the analysis contains data aggregated from customer and information from shops2 (Pennacchioli et al. 2013) and pivoted to new columns. The dataset thus contains 40 features with 60,366 instances and is approximately 14.0 MB in size.
Below are the RQs which were formulated at the initial stages of the project based on a primary understanding of the data but without a detailed Exploratory Data Analysis,
1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?
Relevance: This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions.
2. Which are the products for which customers are ready to travel long distances and for which products they select the closest shop?
Relevance: This will help to understand the nature of the products in the context of proximity. It is assumed that customers will select closest shops to buy daily products like grocery but may travel long distances to buy one-time-purchase products like kitchen equipment. As Data Science is results-driven and not based solely on intuition, this question can help to verify this assumption.
3. What is the maximum likelihood of a customer to select a particular shop to purchase a particular product?
Relevance: This will help to understand that which shops in the retail chain are in demand for a particular product. This can further facilitate better stock management to meet customer demands.
4. What is the ranking of the shops in terms of attracting more customers and thus generating more revenue and what is their individual customer base?
Relevance: This will help to understand the most popular shops in the retail chain and target different shop-level marketing schemes to the appropriate customers through customer segmentation.
5. Which are the customers that are most profitable in terms of revenue generation?
Relevance: This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.
Data Wrangling consists of different steps transforming data from raw form into a clean form which is appropriate and accurate for data analysis. Below are the different steps which were carried out as a part of Data Wrangling,
Visualize the input dataset
library(tidyverse)
library(DataExplorer)
# Read data from the input csv file
filepath<- "Input Dataset/Supermarket aggr.Customer.csv"
supermarket_data <- read_csv(filepath)
# Converts data to tbl class. as tbl’s are easier to examine than data frames and View data set in spreadsheet-like display
supermarket_tbl<-tbl_df(supermarket_data)
# Check the dimension of the input dataset and the variables through a plot
plot_str(supermarket_tbl)
Generate the input dataset statistics
# Data Statistics
gather(introduce(supermarket_tbl))
Generate the input dataset summary
# Data Summary, p0 = min value, p100 = max value
library(skimr)
skim_with(integer = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim_with(numeric = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim(supermarket_tbl) %>% skimr::kable()
## Skim summary statistics
## n obs: 60366
## n variables: 40
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p100
## ---------------------------------- --------- ---------- ------- --------- ---------- ---- -------
## customer_id 0 60366 60366 30183.5 17426.31 1 60366
## products_purchased 1 60365 60366 1778.71 2185.05 1 22131
## products_purchased_shop_1 0 60366 60366 887.81 1438.48 0 17016
## products_purchased_shop_2 0 60366 60366 605.37 1382.14 0 22110
## products_purchased_shop_3 0 60366 60366 156.68 723.82 0 16913
## products_purchased_shop_4 0 60366 60366 56.65 455.1 0 17445
## products_purchased_shop_5 0 60366 60366 72.18 504.96 0 20891
## shops_used 1 60365 60366 2.38 1.01 1 5
## unique_products_purchased 1 60365 60366 330.67 236 1 1465
## unique_products_purchased_shop_1 0 60366 60366 222.3 213.98 0 1459
## unique_products_purchased_shop_2 0 60366 60366 126.61 172.33 0 1161
## unique_products_purchased_shop_3 0 60366 60366 31.91 88.53 0 1048
## unique_products_purchased_shop_4 0 60366 60366 11.57 52.26 0 846
## unique_products_purchased_shop_5 0 60366 60366 15.77 61.93 0 841
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p100
## ------------------------- --------- ---------- ------- --------- --------- ------- ----------
## amount_purchased 1 60365 60366 4235.49 5006.78 0.21 51588.66
## amount_purchased_shop_1 0 60366 60366 2310.43 3509.99 0 47839.38
## amount_purchased_shop_2 0 60366 60366 1373.29 3104.61 0 44033.96
## amount_purchased_shop_3 0 60366 60366 303.26 1391.14 0 31246.35
## amount_purchased_shop_4 0 60366 60366 110.31 880.51 0 29628.03
## amount_purchased_shop_5 0 60366 60366 138.13 960.49 0 36819.42
## avg_distance_to_shops 1 60365 60366 2030.23 1119.91 6.64 9004.16
## avg_price 1 60365 60366 3.67 9.13 0.21 787.57
## avg_price_shop_1 0 60366 60366 4.68 15.96 0 787.57
## avg_price_shop_2 0 60366 60366 2.3 9.97 0 787.57
## avg_price_shop_3 0 60366 60366 0.76 1.9 0 263.23
## avg_price_shop_4 0 60366 60366 0.44 4.39 0 522.61
## avg_price_shop_5 0 60366 60366 0.45 1.33 0 133.97
## avg_purchase 1 60365 60366 8.53 10.32 0.21 787.57
## avg_purchase_shop_1 0 60366 60366 8.1 16.54 0 787.57
## avg_purchase_shop_2 0 60366 60366 4.97 11.23 0 787.57
## avg_purchase_shop_3 0 60366 60366 1.52 3.69 0 263.23
## avg_purchase_shop_4 0 60366 60366 0.76 4.9 0 522.61
## avg_purchase_shop_5 0 60366 60366 0.81 2.6 0 133.97
## distance_shop_1 0 60366 60366 2496.63 1281.53 93.28 8019.92
## distance_shop_2 0 60366 60366 2488.24 1417.36 11.19 9004.16
## distance_shop_3 0 60366 60366 1924.97 1157.75 17.84 7395.25
## distance_shop_4 0 60366 60366 2882.8 1730.09 6.64 9273.69
## distance_shop_5 0 60366 60366 2020.9 1260.85 25.46 7465.81
## max_distance_to_shops 1 60365 60366 2942.67 1327.53 6.64 9267.7
## min_distance_to_shops 1 60365 60366 1396.44 1048.42 6.64 9004.16
Eliminate the mising values in the input dataset
# Eliminate the missing values in the dataset
supermarket_tbl_Clean1<-na.omit(supermarket_tbl)
na.action(supermarket_tbl_Clean1)
## 44617
## 44617
## attr(,"class")
## [1] "omit"
# Percentage of data missing (Should be 0%)
missing_data_count = sum(!complete.cases(supermarket_tbl_Clean1))
total_data = dim(supermarket_tbl_Clean1)[1]
missing_data_percent = (missing_data_count/total_data) * 100
missing_data_percent
## [1] 0
Eliminate the duplicate rows in the input dataset
# Eliminate duplicate rows
distinct(supermarket_tbl_Clean1)
Round the decimal values in the input dataset
# Round the decimal value columns upto 4 decimal places
is.num <- sapply(supermarket_tbl_Clean1, is.numeric)
supermarket_tbl_Clean1[is.num] <- lapply(supermarket_tbl_Clean1[is.num], round, 4)
Rename the column names in the input dataset
# Rename column names
## From products_purchased to products_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'products_purchased'] <- 'products_purchased_total'
## From shops_used to shops_used_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'shops_used'] <- 'shops_used_total'
## From amount_purchased to amount_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'amount_purchased'] <- 'amount_purchased_total'
## From min_distance_to_shops to min_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'min_distance_to_shops'] <- 'min_dist_to_custSel_shops'
## From max_distance_to_shops to max_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'max_distance_to_shops'] <- 'max_dist_to_custSel_shops'
## From unique_products_purchased to unique_products_purchased_total_exclCommon
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'unique_products_purchased'] <- 'unique_products_purchased_total_exclCommon'
## From avg_distance_to_shops to avg_distance_to_all_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_distance_to_shops'] <- 'avg_distance_to_all_shops'
## From avg_price_shop_1 to avg_product_price_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_1'] <- 'avg_product_price_shop_1'
## From avg_price_shop_2 to avg_product_price_shop_2
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_2'] <- 'avg_product_price_shop_2'
## From avg_price_shop_3 to avg_product_price_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_3'] <- 'avg_product_price_shop_3'
## From avg_price_shop_4 to avg_product_price_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_4'] <- 'avg_product_price_shop_4'
## From avg_price_shop_5 to avg_product_price_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_5'] <- 'avg_product_price_shop_5'
## From avg_price to avg_purchased_product_price_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price'] <- 'avg_purchased_product_price_allShops'
## From avg_purchase_shop_1 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_1'] <- 'avg_purchase_amount_shop_1'
## From avg_purchase_shop_2 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_2'] <- 'avg_purchase_amount_shop_2'
## From avg_purchase_shop_3 to avg_purchase_amount_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_3'] <- 'avg_purchase_amount_shop_3'
## From avg_purchase_shop_4 to avg_purchase_amount_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_4'] <- 'avg_purchase_amount_shop_4'
## From avg_purchase_shop_5 to avg_purchase_amount_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_5'] <- 'avg_purchase_amount_shop_5'
## From avg_purchase to avg_purchase_amount_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase'] <- 'avg_purchase_amount_allShops'
Reorder the columns in the input dataset
# Reorder Columns
supermarket_tbl_Clean1 <- supermarket_tbl_Clean1[c(1,10,11,12,13,14,15,3,4,2,16,17,18,19,20,5,21,22,23,24,25,6,36,37,38,39,40,9,26,27,28,29,30,7,31,32,33,34,35,8)]
Write the cleaned dataset tbl to a CSV file
# Write the cleaned data tbl to csv
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
write.csv(supermarket_tbl_Clean1, file = clean_filepath, row.names = FALSE)
Visualize the cleaned dataset
# Check the dimension of the cleaned dataset and the variables
plot_str(supermarket_tbl_Clean1)
Generate the cleaned dataset statistics
# Cleaned data Statistics
gather(introduce(supermarket_tbl_Clean1))
Analyze the Continuous Variables in the cleaned dataset as a Histogram
# Analyze Continuous Variables in the cleaned dataset (Univariate Analysis)
plot_histogram(supermarket_tbl_Clean1)
Interpretation: The histograms plotted depict the distribution of each continuous variable in the dataset. These plots can be used to understand the data spread, whether the data is symmetric or skewed and graphically summarize the univariate dataset distribution.
Examine the corelated features in the cleaned dataset through Correlation Analysis
# Correlation analysis (Multivariate Analysis, On Continuous features only) to examine corelated features in the cleaned dataset
library(ggcorrplot)
corr <- round(cor(supermarket_tbl_Clean1), 1)
ggcorrplot(corr, outline.col = "white") +geom_tile(height=1.8, width=1.8) +
scale_fill_gradient2(low="blue", mid="white", high="red") +
theme_minimal() +
coord_equal() +
labs(x="",y="",fill="Correlation coefficient") +
theme(axis.text.x=element_text(size=7, angle=90, vjust=1, hjust=1,
margin=margin(-3,0,0,0)),
axis.text.y=element_text(size=7, margin=margin(0,-3,0,0)),
panel.grid.major=element_blank())
Interpretation: The Correlation plot depicts the association between the variables in the dataset and the degree of association between them is displayed by the variation in the correlation coefficient color. The plot helps to understand the relationship between the different variables in the dataset.
Visualize the deviation from a specific probability distribution in the cleaned dataset through Quantile-Quantile plot
# Quantile-Quantile plot to visualize the deviation from a specific probability distribution in the cleaned dataset
plot_qq(supermarket_tbl_Clean1)
Interpretation: The Quantile-Quantile plot compares two probability distributions through plotting their quantiles against each other. This plot depicts if both set of quantiles are from the same distribution (points form a roughly straight line), helping to further understand the data distribution.
Exploratory Data Analysis (EDA) is the process of visualizing the main characteristics in the data before the formal modeling on the data to discover data patterns and verify the initial primary assumptions made on the data. Below are the visualizations of the EDA carried out,
1) Visualize the highest revenue generating shops, shops selling the highest number of products, shops selling the highest number of unique products and the relation between them
library(ggplot2)
library(RColorBrewer)
library(scales)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)
Shop<- c(1,2,3,4,5)
# Revenue generation by Shops 1-5
# columns selected: amount_purchased_shop_1, 2, 3, 4, 5
slice1<-select(supermarket_data_clean, 29,30,31,32,33)
amountS1<-sum(slice1$amount_purchased_shop_1)
amountS2<-sum(slice1$amount_purchased_shop_2)
amountS3<-sum(slice1$amount_purchased_shop_3)
amountS4<-sum(slice1$amount_purchased_shop_4)
amountS5<-sum(slice1$amount_purchased_shop_5)
# create data frame for Revenue generation
Revenue_Generated<- c(amountS1,amountS2,amountS3,amountS4,amountS5)
Revenue<- data.frame(Shop, Revenue_Generated)
rownames(Revenue) <- NULL
# Products Sold by Shops 1-5
# columns selected: products_purchased_shop_1, 2, 3, 4, 5
slice2<-select(supermarket_data_clean, 11,12,13,14,15)
productsS1<-sum(slice2$products_purchased_shop_1)
productsS2<-sum(slice2$products_purchased_shop_2)
productsS3<-sum(slice2$products_purchased_shop_3)
productsS4<-sum(slice2$products_purchased_shop_4)
productsS5<-sum(slice2$products_purchased_shop_5)
# create data frame for Products Sold
Products_Purchased<- c(productsS1,productsS2,productsS3,productsS4,productsS5)
ProductsSold<- data.frame(Shop, Products_Purchased)
rownames(ProductsSold) <- NULL
# Unique products Sold by Shops 1-5
# columns selected: unique_products_purchased_shop_1,2,3,4,5
slice3<-select(supermarket_data_clean, 17,18,19,20,21)
uproductsS1<-sum(slice3$unique_products_purchased_shop_1)
uproductsS2<-sum(slice3$unique_products_purchased_shop_2)
uproductsS3<-sum(slice3$unique_products_purchased_shop_3)
uproductsS4<-sum(slice3$unique_products_purchased_shop_4)
uproductsS5<-sum(slice3$unique_products_purchased_shop_5)
# create data frame for Unique products Sold
UProducts_Purchased<- c(uproductsS1,uproductsS2,uproductsS3,uproductsS4,uproductsS5)
UProductsSold<- data.frame(Shop, UProducts_Purchased)
rownames(UProductsSold) <- NULL
# Plot a Bar graph to depict the above calculated data
Legends <-c(rep("Revenue Generated", 5), rep("Products Sold", 5), rep("Unique Products Sold", 5))
values <-c(Revenue_Generated, Products_Purchased, UProducts_Purchased)
mydata <-data.frame(Shop, values)
p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", aes(fill = Legends), position = "dodge") +
xlab("Shop") + ylab("Total") +
ggtitle("Relation between Revenue and Products Sold") +
theme_bw() + scale_y_continuous(labels = scales::comma)
Analysis: As visualized, the shops ordered based on the their highest Revenue Generated are Shop 1, 2, 3, 5, 4. The shops ordered based on the their highest amount of Products Sold are Shop 1, 2, 3, 5, 4. The shops ordered based on the their highest amount of Unique Products Sold are Shop 1, 2, 3, 4 & 5 (are on the same level). The relation between these parameters as visualized based on the shop order can be determined as the shop generating the highest revenue has the highest amount of products sold (unique included). So in the dataset, the relation between the revenue generated and the products sold is directly proportional to each other. The ordering of the shops is mostly stable here at Shop 1, 2, 3, 5, 4.
2) Visualize the approximate customer base count for the different shops
# Approximate Customer Base for Shops 1-5
C1<-slice2$products_purchased_shop_1
custS1<-length(which(C1 !=0))
C2<-slice2$products_purchased_shop_2
custS2<-length(which(C2 !=0))
C3<-slice2$products_purchased_shop_3
custS3<-length(which(C3 !=0))
C4<-slice2$products_purchased_shop_4
custS4<-length(which(C4 !=0))
C5<-slice2$products_purchased_shop_5
custS5<-length(which(C5 !=0))
# create data frame for Approximate Customer Base
Customers<- c(custS1,custS2,custS3,custS4,custS5)
TotalCustomers<- data.frame(Shop, Customers)
rownames(TotalCustomers) <- NULL
# Plot a Bar graph to depict the approximate Customer Base
values <-c(Customers)
mydata <-data.frame(Shop, values)
p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", fill = "gray" , position = "dodge", color = "black") +
xlab("Shop") + ylab("Total") +
ggtitle("Customer Base") +
theme_bw()
Analysis: As visualized, the shops ordered based on the their highest approximate customer base are Shop 1, 2, 3, 5, 4. So the highest approximate customer base for a shop determines its popularity in terms of a customer’s product purchase from the shop. The ordering of the shops is here as Shop 1, 2, 3, 5, 4.
3) Visualize the relationship between average prices and distances to the shop
library(modelr)
library(gridExtra)
cleared_supermarket_data<-read_csv(file_path)
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)
shop_ordered_slice <- select(cleared_supermarked_tbl, 3,23,4,24,5,25,6,26,7,27) %>%
bind_cols(cleared_supermarked_tbl[,8:10], cleared_supermarked_tbl[,28])
# Splitting the data
get_slice_for_shop <- function(col1, col2){
shop_slice <- shop_ordered_slice[,col1:col2]
colnames(shop_slice) <- c("distance","price")
return(shop_slice)
}
shop_1_data <- get_slice_for_shop(1,2)
shop_2_data <- get_slice_for_shop(3,4)
shop_3_data <- get_slice_for_shop(5,6)
shop_4_data <- get_slice_for_shop(7,8)
shop_5_data <- get_slice_for_shop(9,10)
shop_avg_data <- get_slice_for_shop(13,14)
shop_agg_min_data <- get_slice_for_shop(11,14)
shop_agg_max_data <- get_slice_for_shop(12,14)
# Combine data to the one mutated table to show all shops at the one graph
joined_shops_data <- mutate(shop_1_data, Shop="1") %>%
union_all(mutate(shop_2_data, Shop="2")) %>%
union_all(mutate(shop_3_data, Shop="3")) %>%
union_all(mutate(shop_4_data, Shop="4")) %>%
union_all(mutate(shop_5_data, Shop="5"))
# Create base for plots
get_base_for_plot <- function(dataset, caption){
plot_base <- ggplot(data = dataset, mapping = aes(x = distance, y = price)) + ggtitle(caption)
return(plot_base)
}
# Colours list
colours_shema <- c("Red", "Green", "Yellow", "Pink", "Blue", "Purple", "steelblue1", "tomato1")
#create scatter plot
add_geom_point <- function(colorNum){
geom_p <- geom_point(colour=colours_shema[colorNum], alpha=0.3)
return(geom_p)
}
#draw scatter plot with plot base
draw_cov_point_plot <- function(dataset, colorNum, caption){
cov_geom_plot <- get_base_for_plot(dataset, caption) + add_geom_point(colorNum) +
scale_y_continuous(trans="log2")+
geom_smooth(stat = 'smooth', color = 'Black', method = 'gam', formula = y ~ s(x, bs = "cs"))
return(cov_geom_plot)
}
p1_1 <- draw_cov_point_plot(shop_1_data, 1, "Shop 1")
p2_1 <- draw_cov_point_plot(shop_2_data, 2, "Shop 2")
p3_1 <- draw_cov_point_plot(shop_3_data, 3, "Shop 3")
p4_1 <- draw_cov_point_plot(shop_4_data, 4, "Shop 4")
p5_1 <- draw_cov_point_plot(shop_5_data, 5, "Shop 5")
pavg_1 <- draw_cov_point_plot(shop_avg_data, 6, "Average price with average distance")
pmin_1 <- draw_cov_point_plot(shop_agg_min_data, 7, "Average price with min distance")
pmax_1 <- draw_cov_point_plot(shop_agg_max_data, 8, "Average price with max distance")
pall_1 <- get_base_for_plot(joined_shops_data, "All shops") + geom_point(mapping = aes(colour = Shop), alpha=0.3)
comb_cov_shops <- grid.arrange(p1_1, p2_1, p3_1, p4_1, p5_1,
nrow=2, ncol=3,
top="Covariation between distances and average prices")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 4813 rows containing non-finite values (stat_smooth).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 15571 rows containing non-finite values (stat_smooth).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 39862 rows containing non-finite values (stat_smooth).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 49965 rows containing non-finite values (stat_smooth).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 47942 rows containing non-finite values (stat_smooth).
comb_cov_aggrs <- grid.arrange(pmin_1, pmax_1,
nrow=2,
top= "Covariation between min/max distances and average prices")
comb_cov_avg <- grid.arrange(pall_1, pavg_1,
nrow=2,
top= "Covariation between distances and average prices (aggregated)")
Analysis: There are strong dependencies between long average distance and the average price in a shop. Also the average price is close enough to zero, therefore it makes a sense to check for zero values for price in the current dataset and its influence.
4) Visualize data gap for average price in each shop
prices <- shop_ordered_slice[,seq(2, 10 ,2)]
names(prices) <- c("Shop 1", "Shop 2", "Shop 3", "Shop 4", "Shop 5")
dataset_with_na <- data.frame(sapply(prices, function(x) {
na_if(x,0)
} ))
pavg_2 <- plot_missing(dataset_with_na) + ggtitle("Gaps in shop prices")
Analysis: As visualized, there is a data gap for the value of average price in each shop. The value for average price in a shop for a customer is only filled in the dataset if the particular customer prefers the shop, else it is left as zero. This can be considered as a data gap but at the same time it is an information which is never utilized during analysis, meaning that the customer does not choose that particular shop in the first place. But this data gap does not affect on the tendency of the relationship between price and distance.
5) Visualize the distribution density of the average price across shops
# Normilizing of average price for common average data
dt_avg <- transform(shop_avg_data, price = pnorm(price))
pavg_3 <- ggplot(data = dt_avg, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(colour=colours_shema[6], binwidth = 0.1) + ggtitle("Average price distribution")
# Normilizing of average price for shops
dt_all <- transform(joined_shops_data, price = pnorm(price))
pall_2 <- ggplot(data = dt_all, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = Shop), binwidth = 0.1) + ggtitle("Common average price distribution")
comb_vis_distibution <- grid.arrange(pavg_3, pall_2,
nrow=2, ncol=1,
top="Distribution of average price across shops")
Analysis: As visualized, according to the data the average price is varyingly distributed in each shop.
6) Visualize patterns for the average price
joined_shops_without_null <- filter(joined_shops_data, price != 0)
mod <- lm(log(price) ~ log(distance), data = joined_shops_without_null)
joined_shops_data2 <- joined_shops_without_null %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
pall_3 <- ggplot(data = joined_shops_data2) +
geom_point(mapping = aes(x = price, y = resid), colour=colours_shema[6], alpha=0.3) + ggtitle("Average price pattern")
pall_3
pall_4 <- ggplot(data = joined_shops_data2) +
geom_boxplot(mapping = aes(x = Shop, y = resid), color=colours_shema[6]) + ggtitle("Average price pattern")
pall_4
Analysis: As visualized, the residuals gave us a view of the average price after removing the distance effect. Once the strong relationship between distance and price has been removed, relationship to other external factors become noticable.
7) Visualize the customer base for each shop based on different customer segments
## Ratio of products purchased from each shop (product purchased by a customer at a shop over total produccts purchased by that customer)
prod_purch_ratio_shop_1 <- with(supermarket_data_clean, products_purchased_shop_1/products_purchased_total)
prod_purch_ratio_shop_2 <- with(supermarket_data_clean, products_purchased_shop_2/products_purchased_total)
prod_purch_ratio_shop_3 <- with(supermarket_data_clean, products_purchased_shop_3/products_purchased_total)
prod_purch_ratio_shop_4 <- with(supermarket_data_clean, products_purchased_shop_4/products_purchased_total)
prod_purch_ratio_shop_5 <- with(supermarket_data_clean, products_purchased_shop_5/products_purchased_total)
## Ratio of amount spent at each shop (amount spent by a customer at a shop over total amount spent by that customer)
amt_purch_ratio_shop_1 <- with(supermarket_data_clean, amount_purchased_shop_1/amount_purchased_total)
amt_purch_ratio_shop_2 <- with(supermarket_data_clean, amount_purchased_shop_2/amount_purchased_total)
amt_purch_ratio_shop_3 <- with(supermarket_data_clean, amount_purchased_shop_3/amount_purchased_total)
amt_purch_ratio_shop_4 <- with(supermarket_data_clean, amount_purchased_shop_4/amount_purchased_total)
amt_purch_ratio_shop_5 <- with(supermarket_data_clean, amount_purchased_shop_5/amount_purchased_total)
## Loyality of customer to a shop (Ratio of products purchased from each shop times Ratio of amount spent at each shop)
loyality_shop_1 <- prod_purch_ratio_shop_1 * amt_purch_ratio_shop_1
loyality_shop_2 <- prod_purch_ratio_shop_2 * amt_purch_ratio_shop_2
loyality_shop_3 <- prod_purch_ratio_shop_3 * amt_purch_ratio_shop_3
loyality_shop_4 <- prod_purch_ratio_shop_4 * amt_purch_ratio_shop_4
loyality_shop_5 <- prod_purch_ratio_shop_5 * amt_purch_ratio_shop_5
## Data frames based on calculated values
prod_purch_ratio_shop <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_1, prod_purch_ratio_shop_2, prod_purch_ratio_shop_3, prod_purch_ratio_shop_4, prod_purch_ratio_shop_5)
amt_purch_ratio_shop <- data.frame(supermarket_data_clean$customer_id, amt_purch_ratio_shop_1, amt_purch_ratio_shop_2, amt_purch_ratio_shop_3, amt_purch_ratio_shop_4, amt_purch_ratio_shop_5)
loyality <- data.frame(supermarket_data_clean$customer_id, loyality_shop_1, loyality_shop_2, loyality_shop_3, loyality_shop_4, loyality_shop_5)
## Data frames as per shop nos
shop_1 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_1, amt_purch_ratio_shop_1, loyality_shop_1)
shop_2 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_2, amt_purch_ratio_shop_2, loyality_shop_2)
shop_3 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_3, amt_purch_ratio_shop_3, loyality_shop_3)
shop_4 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_4, amt_purch_ratio_shop_4, loyality_shop_4)
shop_5 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_5, amt_purch_ratio_shop_5, loyality_shop_5)
## Customer segments for Shop 1 based on Loyality Score
high_end_shop_1 <- shop_1[which(shop_1$loyality_shop_1 > 0.9), ]
mid_range_shop_1 <- shop_1[which(shop_1$loyality_shop_1 > 0.50 & shop_1$loyality_shop_1 <= 0.9), ]
low_end_shop_1 <- shop_1[which(shop_1$loyality_shop_1 > 0 & shop_1$loyality_shop_1 <= 0.5), ]
## Customer segments for Shop 2 based on Loyality Score
high_end_shop_2 <- shop_2[which(shop_2$loyality_shop_2 > 0.9), ]
mid_range_shop_2 <- shop_2[which(shop_2$loyality_shop_2 > 0.50 & shop_2$loyality_shop_2 <= 0.9), ]
low_end_shop_2 <- shop_2[which(shop_2$loyality_shop_2 > 0 & shop_2$loyality_shop_2 <= 0.5), ]
## Customer segments for Shop 3 based on Loyality Score
high_end_shop_3 <- shop_3[which(shop_3$loyality_shop_3 > 0.9), ]
mid_range_shop_3 <- shop_3[which(shop_3$loyality_shop_3 > 0.50 & shop_3$loyality_shop_3 <= 0.9), ]
low_end_shop_3 <- shop_3[which(shop_3$loyality_shop_3 > 0 & shop_3$loyality_shop_3 <= 0.5), ]
## Customer segments for Shop 4 based on Loyality Score
high_end_shop_4 <- shop_4[which(shop_4$loyality_shop_4 > 0.9), ]
mid_range_shop_4 <- shop_4[which(shop_4$loyality_shop_4 > 0.50 & shop_4$loyality_shop_4 <= 0.9), ]
low_end_shop_4 <- shop_4[which(shop_4$loyality_shop_4 > 0 & shop_4$loyality_shop_4 <= 0.5), ]
## Customer segments for Shop 5 based on Loyality Score
high_end_shop_5 <- shop_5[which(shop_5$loyality_shop_5 > 0.9), ]
mid_range_shop_5 <- shop_5[which(shop_5$loyality_shop_5 > 0.50 & shop_5$loyality_shop_5 <= 0.9), ]
low_end_shop_5 <- shop_5[which(shop_5$loyality_shop_5 > 0 & shop_5$loyality_shop_5 <= 0.5), ]
## Number of customers in each segment for Shop 1
count_high_end_shop_1 <- nrow(high_end_shop_1)
count_mid_range_shop_1 <- nrow(mid_range_shop_1)
count_low_end_shop_1 <- nrow(low_end_shop_1)
## Number of customers in each segment for Shop 2
count_high_end_shop_2 <- nrow(high_end_shop_2)
count_mid_range_shop_2 <- nrow(mid_range_shop_2)
count_low_end_shop_2 <- nrow(low_end_shop_2)
## Number of customers in each segment for Shop 3
count_high_end_shop_3 <- nrow(high_end_shop_3)
count_mid_range_shop_3 <- nrow(mid_range_shop_3)
count_low_end_shop_3 <- nrow(low_end_shop_3)
## Number of customers in each segment for Shop 4
count_high_end_shop_4 <- nrow(high_end_shop_4)
count_mid_range_shop_4 <- nrow(mid_range_shop_4)
count_low_end_shop_4 <- nrow(low_end_shop_4)
## Number of customers in each segment for Shop 5
count_high_end_shop_5 <- nrow(high_end_shop_5)
count_mid_range_shop_5 <- nrow(mid_range_shop_5)
count_low_end_shop_5 <- nrow(low_end_shop_5)
## Setting up values
shops <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5)
customer_segments <- c("high", "med", "low", "high", "med", "low", "high", "med", "low", "high", "med", "low", "high", "med", "low")
count_shop_1 <- c(count_high_end_shop_1, count_mid_range_shop_1, count_low_end_shop_1)
count_shop_2 <- c(count_high_end_shop_2, count_mid_range_shop_2, count_low_end_shop_2)
count_shop_3 <- c(count_high_end_shop_3, count_mid_range_shop_3, count_low_end_shop_3)
count_shop_4 <- c(count_high_end_shop_4, count_mid_range_shop_4, count_low_end_shop_4)
count_shop_5 <- c(count_high_end_shop_5, count_mid_range_shop_5, count_low_end_shop_5)
counts <- c(count_shop_1, count_shop_2, count_shop_3, count_shop_4, count_shop_5)
## Count of customers based on the three customer segments (high, mid, low)
shop_customer_segment_count <-data.frame(shops, customer_segments, counts)
## Change datatype of columns to numerical
shop_customer_segment_count <- shop_customer_segment_count %>% mutate_if(is.integer, as.numeric) %>% mutate_if(is.factor, as.numeric)
## Plot the count of customers for each shop based on customer segments (high - 1, mid - 3, low - 2)
qplot(shops, counts, data = shop_customer_segment_count, size=counts, color = customer_segments) +
scale_size_continuous(range = c(3, 8))
## Plot the count of customers for each segment
qplot(customer_segments, counts, data = shop_customer_segment_count, size=counts, color = counts) +
scale_size_continuous(range = c(3, 8))
Analysis: As visualized, the customers can be categorized on the basis of three different customer segements i.e. high, mid and low in terms of their loyality towards the shops and promotional offers can be customised for each segment.
Exploratory Data Analysis provided a feasibility check on the Initial RQs formulated. The EDA phase helped to get a better understanding of the data in relation with the project objective. Hence this lead to the modification, removal or addition of new RQ. Below is the final set of RQ formulated which will be answered through this project,
1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?
Relevance: This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions.
2. What are the factors that contribute towards the long distance travel of the customer to purchase products in spite of the high average product price in a shop?
Relevance: This will help to understand the important factors that contribute towards the customers willing to travel long-distances to purchase products in spite of the high average product price in a shop, in turn better understanding the purchase behaviour of the customers
3. What is the maximum likelihood of a customer to select a particular shop?
Relevance: This will help to understand that which shops in the retail chain are in demand and predict their estimated customer base. This can further facilitate better stock management to meet customer demands.
4. What are the different customer segments based on their purchase behaviour?
Relevance: This will help to understand the groups of similar customer based on their purchase behviour and target different shop-level marketing schemes to the appropriate customers.
5. Which are the Top 100 customers that are most profitable in terms of revenue generation?
Relevance: This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.
1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?
Algorithms selected:
Comparision of the selected algorithms:
Analysis:
2. What are the factors that contribute towards the long distance travel of the customer to purchase products in spite of the high average product price in a shop?
Algorithms selected:
Comparision of the selected algorithms:
Analysis:
3. What is the maximum likelihood of a customer to select a particular shop?
Algorithms selected:
Comparision of the selected algorithms:
Analysis:
4. What are the different customer segments based on their purchase behaviour?
Algorithms selected:
Comparision of the selected algorithms:
Analysis:
5. Which are the Top 100 customers that are most profitable in terms of revenue generation?
Algorithms selected:
Comparision of the selected algorithms:
Analysis:
GitHub Repository
https://github.com/Rspawar/Data-Science-with-R.git
Project Website
https://sites.google.com/view/customerbehaviouralanalytics
Project Screencast
Project Presentation
Complete Analysis Report as RMarkdown file
Complete Analysis Report as HTML file
December 12, 2018
Pennacchioli, Diego, Michele Coscia, Salvatore Rinzivillo, Dino Pedreschi, and Fosca Giannotti. 2013. “Explaining the Product Range Effect in Purchase Data.” In Big Data, 2013 Ieee International Conference on, 648–56. IEEE.